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Greedy methods, randomization approaches and 
multi-arm bandit algorithms for efficient 
sparsity-constrained optimization 

A. Rakotomamonjy Member, IEEE, S. K090, L. Ralaivola 


Abstract —Several sparsity-constrained algorithms such as Or¬ 
thogonal Matching Pursuit or the Frank-Wolfe algorithm with 
sparsity constraints work hy iteratively selecting a novel atom 
to add to the current non-zero set of variahles. This selection 
step is usually performed hy computing the gradient and then 
hy looking for the gradient component with maximal absolute 
entry. This step can he computationally expensive especially for 
large-scale and high-dimensional data. In this work, we aim at 
accelerating these sparsity-constrained optimization algorithms 
hy exploiting the key observation that, for these algorithms to 
work, one only needs the coordinate of the gradient’s top entry. 
Hence, we introduce algorithms based on greedy methods and 
randomization approaches that aim at cheaply estimating the 
gradient and its top entry. Another of our contribution is to cast 
the problem of finding the best gradient entry as a best arm 
identification in a multi-armed bandit problem. Owing to this 
novel insight, we are able to provide a bandit-based algorithm 
that directly estimates the top entry in a very efficient way. 
Theoretical observations stating that the resulting inexact Frank- 
Wolfe or Orthogonal Matching Pursuit algorithms act, with high 
probability, similarly to their exact versions are also given. We 
have carried out several experiments showing that the greedy 
deterministic and the bandit approaches we propose can achieve 
an acceleration of an order of magnitude while being as efficient 
as the exact gradient when used in algorithms such as OMP, 
Frank-Wolfe or CoSaMP. 

Index Terms —Sparsity, Orthogonal Matching Pursuit, Frank- 
Wolfe algorithm, Greedy methods, Best arm identification. 

I. Introduction 

Over the last decade, there has been a large interest in 
inference problems featuring data of very high-dimension and 
a small number of observations. Such problems occur in a wide 
variety of application domains ranging from computational 
biology and text mining to information retrieval and finance. 
In order to learn from these datasets, statistical models are 
frequently designed so as to feature some sparsity properties. 
Hence, fueled by the large interest induced by these application 
domains, an important amount of research work in the machine 
learning, statistics and signal processing communities, has 
been devoted to sparse learning and as such, many algorithms 
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have been developed for yielding models that use only few 
dimensions of the data. To obtain these models, one typically 
needs to solve a problem of the form 

minL(w) subject to ||w h<K (1) 

W 

where L(w) is a smooth convex objective function that mea¬ 
sures a goodness of fit of the model, ||w||o is the fg pseudo¬ 
norm that counts the number of non-zero components of the 
vector w and AT is a parameter that controls the sparsity level. 

One usual approach that has been widely considered is the 
use of a convex and continuous surrogate of the £0 pseudo¬ 
norm, namely an £i-norm. The resulting problem is the well- 
known Lasso problem 11 and a large variety of algorithms 
for its resolution exist, ranging from homotopy methods II 
to the Frank-Wolfe (FW) algorithm 0. In the same flavour, 
non-convex continuous penalties are also common solutions 
for relaxing the £0 pseudo-norm ii, Q. Another possible 
approach is to consider greedy methods that provide local 
optimal solution to the problem Q. In this last context, a 
flurry of algorithms have been proposed, the most popular 
ones being the Matching Pursuit (MP) and the Orthogonal 
Matching Pursuit (OMP) algorithms a, 0,0. One common 
point of the aforementioned algorithms for solving problem 
0 is that they require, at each iteration, the computation 
of the objective function’s gradient. For large-scale and high¬ 
dimensional setting, computing the gradient at each iteration 
may be very time-consuming. 

Stochastic gradient descent (SGD) algorithms are now clas¬ 
sical methods for avoiding the computation of the full gradient 
in large-scale learning problems ii, M. Most of these works 
have been devoted to smooth composite optimization although 
some efforts addressing £1-regularized problems exist HD. 
Recently, these SGD algorithms have been further accelerated 
through the introduction of variance reduction methods for 
gradient estimation 02. In the context of problem 0 where 
a non-smooth non-convex constraint appears, very few works 
have envisaged the use of stochastic optimization. Nguyen 
et al. ifTSll have proposed stochastic versions of a gradient 
pursuit algorithm. Following a similar direction, by exploiting 
inexact gradient information, we address the problem of accel¬ 
erating sparsity-inducing algorithms such as Matching Pursuit, 
Orthogonal Matching Pursuit and the Frank-Wolfe algorithm 
with an £1 ball constraint. However, unlike stochastic gradient 
approaches, the acceleration we propose leverages on the fact 
that at each iteration, these algorithms seek for the gradient’s 
component that has the largest (absolute) value. 
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Hence, our main contribution in this paper is to propose 
novel algorithms that allow to efficiently find this top entry 
of the gradient. By doing so, our objective is to design 
novel efficient versions of MR, OMP and FW algorithms 
while keeping intact all the properties of these algorithms 
for sparse approximation. Indeed, this becomes possible if at 
each iteration of MP, OMP or FW, our inexact gradient-based 
estimation of the component with largest value is the same as 
the one obtained with the exact gradient. 

We propose two approaches, the first one based on a greedy 
deterministic algorithm and the second one based on a random¬ 
ized method, whose aim is to build an inexact estimation of 
the gradient so that its top entry in the same as the exact one. 
Next, by casting the problem as a best arm identification multi¬ 
armed bandit problem llT4l . we are able to derive an algorithm 
that directly estimates the best component of the gradient. 
Interestingly, these algorithms are supported by theoretical 
evidences that with high probability, they are able to retrieve 
the correct component of the gradient. As a consequence, 
we show that MP, OMP and FW algorithms employing these 
approaches for spotting the correct component of the gradient 
behave as their exact counter part with high probability. 

This paper is an extended version of the conference paper 
m- It provides full details on the context of the problem and 
proposes a novel key contribution based on multi-arm bandits. 
In addition, it gives enlightning insights compared to related 
works. Extended experimental analysis also strengthen the 
results compared to the conference version. The remainder of 
the paper is organized as follows. Section [represents sparsity- 
constrained algorithms, formalizes our problem and introduces 
the key observation on which we build our acceleration strat¬ 
egy. Section [^ provides the different algorithms we propose 
for efficiently estimating the extreme gradient component of 
interest. A discussion with related works is given in Section 
IV Experimental results are depicted in Section [^ while 


Section VI concludes the work and presents different outlooks. 


II. Sparse Learning Algorithm with Extreme 
Gradient Component 

In this section, we introduce the sparse learning problem we 
are interested in, as well as some algorithms that are frequently 
used for sparse learning. We then point out the prominent 
common trait of those algorithms, namely the extreme gradient 
component property, and discuss how its estimation can be 
employed for accelerating some classes of sparse learning 
algorithms. 


A. Framework 

Consider the problem where we want to estimate a relation 
between a set of n samples gathered in a vector y G K" 
and the matrix X G In a sparse signal approximation 

problem, X would be a matrix whose columns are the elements 
of a dictionary and y the target signal, while in a machine 
learning problem, the i-th row of matrix X, formalized as , 
Xi G R'^, depicts the features of the Ath example and is the 
label or target associated to that example. In the sequel, we 
denote as Xij the entry of X at the i-th row and j-th column. 


Algorithm 1 Gradient Pursuit Algorithm 
I: set k = 0 , initialize wg = 0 
2: for k=0,l, • • • do 

3: 2* = argmaxj |VL(wfc)|j 

4: Efc = Efc U {2*} 

5: Wfe+i = argminL(w) over E^ with Wpc = 0 

6 : end for 


Our objective is to learn the relation between y and X 
through a linear model of the data denoted Xw by looking 
for the vector w that solves problem Q when the objective 
function is of the form 


■^(w) = Y^£{yi,g{w^Xi)), 

i 

where I is an individual loss function that measures the 
discrepancy between a true value y and its estimation, and 
g{-) is a given differentiable function that depicts the (po¬ 
tential) non-linear dependence of the loss to w. Typically, 
£ might be the least-square error function, which leads to 
L(w) = |||y — Xw||2 or the logistic loss and we have 
Li'^) = log(l + exp{—i/iX^w}). In the sequel, we 

present two algorithms that solve problem Q by a greedy 
method and by a continuous and a convex relaxation of the £q 
pseudo-norm, respectively. 


B. Algorithms 

1 ) Gradient Pursuit: This algorithm is a generalization of 
the greedy algorithm known as Orthogonal Matching Pursuit 
(OMP) to generic loss functions lEl. It can be directly applied 
to our problem given in Equation 0- Similarly to OMP, 
the gradient pursuit algorithm is a greedy iterative procedure, 
which, at each iteration, selects the largest absolute coordinate 
of the loss gradient. This coordinate is added to the set of 
already selected elements and the new iterate is obtained as 
the solution of the original optimization problem restricted to 
these selected elements while keeping the other ones at 0 . The 
detailed procedure is given in Algorithm 

While conceptually simple, this algorithm comes with theo¬ 
retical guarantees on its ability to recover the exact underlying 
sparsity pattern of the model, for different types of loss 
functions Q, ini, US). 

Several variations of this algorithm have been proposed 
ranging from methods exploring new pursuits directions in¬ 
stead of the gradient HU, to methods making only slight 
changes to the original one that have strongly impacted the 
ability of the algorithm to recover signals. Eor instance, the 
CoSaMP lfT9l and GraSP ll20l algorithms select the top 2 K 
entries in the absolute gradient, K being the desired sparsity 
pattern, optimize over these entries and the already selected 
K ones, and prune the resulting estimate to be AT-sparse. 
In addition to being efficient, these algorithms output sparse 
vectors whose distances from the true sparse optimum are 
bounded. 
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Algorithm 2 Frank-Wolfe Algorithm 

1 

set k = 0 , initialize 

Wo = 0 

2 

for k= 0 ,l, • • • do 


3 

Sfc = argminsgc 

sTVL(wfc) 

4 

dfc = Sfc — Wfc 


5 

set, linesearch or 

optimize 7fc G [ 0 ; 1 ] 

6 

Wfc+i = Wfc - 1 - 7 fcdfc 

7 

end for 



2 ) Frank-Wolfe algorithm: The Frank-Wolfe (FW) algo¬ 
rithm aims at solving the following problem 

min L(w) ( 2 ) 

wGC 

with w G L a convex and differentiable function with 
Lipschitz gradient and C a compact subset of The FW 
algorithm, given in Algorithm]^ is a straightforward procedure 
that solves problem]^ by first iteratively looking for a search 
direction and then updating the current iterate. The search di¬ 
rection Sfc is obtained from the following convex optimization 
problem (line Alg 

Sfc = argmins^VL(wfe), ( 3 ) 

sGC 

which may be efficiently solved, depending on the constraint 
set C. For achieving sparsity, we typically choose C as a 
fi-norm ball, e.g. C = {w : ||w||i < 1 }, which turns ([^ 
into a linear program. Despite its simplistic nature, the FW 
algorithm has been shown to be linearly convergent im, 
0 . Interestingly, it can also be shown that convergence is 
preserved as long as the following condition holds 

s^VL(wfe) < mins^VL(wfc) -f e, ( 4 ) 

sGC 

where e depends on the smoothness of L and the step-size 
7fc. In general cases where the minimization problem 0 is 
expensive to solve, this condition suggests that an approximate 
solution Sfc may be sufficient, provided that the true gradient 
is available. Similarly, if an inexact gradient information 
VL(wfc) is available, convergence is still guaranteed under 
the condition that sJVi(wfc) < minggp s^VL(wfc) + e, 
and some conditions relating WL{wk) and VL(wfc). For 
instance, if C is a unit-norm ball associated with some norm 
II • II and Sfc a minimizer of minggc VL(wfc), i.e. Sfc = 
argminsgc s^VL(wfc), then in order to ensure convergence, 
it is sufficient to have Sfc so that 0 

||VL(wfc)-VL(wfc)|U<e, ( 5 ) 

where || • ||* is the conjugate norm associated with |j • ||. 

Interestingly, the above equation provides a guarantee on the 
convergence on an inexact gradient Frank-Wolfe algorithm, but 
unfortunately, the precision needed on the inexact gradient is 
given with respect to the exact one. However, while condition 
0 is impractical, it conveys the important idea that inexact 
gradient computation may be sufficient for solving sparsity- 
constrained optimization problems. 


C. Leveraging from the extreme gradient component estima¬ 
tion 

As stated above, the gradient pursuit algorithm needs 
to solve at each iteration the following problem j* = 
argmaxj \WL{wk)\j, i.e., the goal is to find at each iteration 
the gradient’s component with the largest absolute value. 

In the Frank-Wolfe algorithm, similar situations where hnd- 
ing Sfc corresponds to looking for an extreme component of the 
(absolute) gradient, occur. For instance, when the constraint set 
is the £i-norm ball Ci — {w G : |lw||i < 1} or the posi¬ 
tive simplex constraint C2 = {w G : ||w||i = l,Wj > 0} 
and we denote 

s* = arg min s^VL(wfc) and j* = argmax |VL(wfc)Lj , 

sSCi j 

or 

s* = arg min s^VL(wfc) and j* = argmin VL(wfc)L-, 

SGC 2 j 

then s* = ej*, with ej* the j*-th canonical vector of 
Hence, as in Matching Pursuit, OMP or the gradient pursuit 
algorithm, for specific choices of C, we are not interested in 
the full gradient, nor in its extreme values, but only on the 
coordinate of the gradient component with the smallest, the 
largest (absolute) value. 

Our objective in this paper is to leverage on these observa¬ 
tions for accelerating sparsity-inducing algorithms which look 
for one extreme component of the gradient. In particular, we 
are interested in situations where the gradient is expensive 
to compute and we aim at providing computation-efficient 
algorithms that produce either approximated gradients whose 
extreme component of interest is the same as the exact 
gradient’s one, or identify the extreme component without 
computing the whole (exact) gradient. 

HI. Looking for the extreme gradient component 

This section formalizes the problem of identifying the 
extreme gradient component and provides different algorithms 
for its resolution. We first introduce a greedy approach, then 
a randomized one and finally exhibit how this problem can 
be cast as best arm identihcation in a multi-armed bandit 
framework. 

A. The problem 

As mentioned in Section [I^ we are interested in learning 
problems where the objective function is of the form 

i 

The gradient of our objective function is thus given by 

VL(w) = ^ ifpi, g(w^x*)))/(w^Xi)xi = X^r (6) 

i 

where r G K" is the vector whose i-th component is 
p(w^Xi)))p'(w^Xi). This particular form entails that 
the gradient may be computed iteratively. Indeed, the sum 
VL(w) = X)r=i is invariant to the order according to 
which index i decribes the set [l,...,n]. This makes possible 
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Algorithm 3 Greedy deterministic algorithm to compute VL 
Input: : r, {||x,|j}, X 

1 : [values, indices]=sort {|?'i|||xi||}i in decreasing order 

2: VLt = 0, t = 0 

3: repeat 

4: * =indices(t + 1) 

5: + Xjrj 

6: t = t “t- 1 

7: until stopping criterion over WLt+i is met 

Output: VLt+i 


the computation, at each iteration t, of an approximate gradient 
VLj. Denote Xt the hrst t indices used for computing the 
sum and ij+i the {t + l)-th index, then, at iteration t, 
WLt = J2ieXt 

VLt+i = VLt + (7) 

According to this framework, our objective is then to hnd an 
efficient way for computing an approximate gradient VLt for 
which the desired extreme entry is equal to the one of the exact 
gradient. Note that the computational cost of the exact gradient 
X^r is about Oind) and a naive computation of the residual 
r has the same cost, which leads to a global complexity of 
0{2nd). However, because w is typically a sparse vector in 
Gradient Pursuit or in the Frank-Wolfe algorithm, we compute 
the residual vector r as r = y — X^wn where H are the 
indices of the non-zero elements of the fc-sparse vector w. 
Hence, computing r has only a cost of 0{nk) and it does not 
require a full pass on all the elements of X. The main objective 
of our contributions is to estimate the extreme entry of X^r 
with algorithms having a complexity lower than 0{nd). We 
propose and discuss, in the sequel, three different approaches. 

B. Greedy deterministic approach 

The hrst approach we propose is a greedy approach which, 
at iteration t, looks for the best index i so that r^x^ optimizes 
a criterion depending on VLj and VL. 

Let Xf be the set of indices of the examples chosen in the 
hrst t iterations for computing VLj. At iteration t+1, our goal 
is to hnd the example i* that is the solution of the following 
problem: 

i* = argmin,g{i_ II VL - VL* - x,ri||, ( 8 ) 

with VLt+i = WLt + Xjr^. The solution of this problem 
is thus the vector product r^x^ that has minimal norm dif¬ 
ference compared to the current gradient estimation residual 
VL — VLt. While simple, this solution cannot be computed 
because VL is not accessible. Hence, we have to resort to an 
approximation based on the following equivalent problem: 

i* = argmaxtg{i -||VL - VLt+i|| (9) 

= argmax,g{i_..._„}\xt llVL - VLt|i - ||VL - VLt+i|| 

where we use the fact that ||VL —VLt|| does not depend on i. 
By upper bounding the above objective value, one can derive 


Algorithm 4 Computing VL as an empirical average 
Input: : r, ||xt|j,X 

1 : build p e a; e.g pt cx ^ or p* cx |ri|||xi|| 

2 : repeat 

3: random draw i G 1, ■ ■ ■ ,n according to p 

4: VLt+i = VLt -f 

5: f — f -f 1 

6 : until stopping criterion over VLt+i is met 

Output: VLt+i 


the best choice of r^Xt that achieves the largest variation of 
the gradient estimation norm residual. Indeed, we have 

||VL - VLtll - ||VL - VLt+ill < II - VLt + VLt+i|| 
The right-hand side of this equation can be further simplihed 
liVLt+i - VLtll = liVLt - x,r, - VLt|| 

= lixtrill = ||xi|||ri|. ( 10 ) 

Equation suggests that the index i that should be 
chosen at iteration f -f 1 is the one with the largest absolute 
residual weighted by its example norm. Simply put, in the 
hrst iteration, the algorithm chooses the index that leads to the 
largest value ||xi|| jr^l , in the second one, it selects the second 
best, and so forth. That is, the examples are considered in a 
decreasing order with respect to the weighted value of their 
residuals. Pseudo-code of the approach is given in Algorithm 

13 

Note that for this method, at iteration f = n we recover the 
exact gradient, 

V Lyj = VL. 

Hence, when f = n, we are assured to retrieve the correct 
extreme entry of the gradient at the expense of extra compu¬ 
tations needed for running this greedy approach compared to 
a plain computation of the full gradient. On the other hand, if 
we stop this gradient estimation procedure before f = n, we 
save computational efforts at the risk of missing the correct 
extreme entry. 

C. Matrix-Vector Product as Expectations 

The problem of hnding the extreme component of the 
gradient can also be addressed from the point of view of 
randomization, as described in Algorithmic 

The approach consists in considering the computation of 
X^r as an the expectation of a given random variable. Recall 
that X is composed of the vectors {x^ }'i=i x^ G 
Hence, the matrix-vector product X^r can be rewritten: 

n 

= ( 11 ) 

i=l 

From now on, given some integer n, A* denotes the interior 
of the probabilistic simplex of size n: 

= S P = [L 1 • • -Pn] ■ ^P^ = 1, Pi > 0,7 = 1,. . . ,n V . 


( 12 ) 
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For any element p = (Pi)iG[rt] S A* we introduce a random 
vector C that takes value in the set 


obtained from the fc-th pull of arm j G [1, • • • , d] as a random 
variable V, independent of k, that takes value in the set 


C = {ci = TiXijpi : i = 1,... ,n} 
so that P((7 = a) = Pi. This way, 

n n n 

EC = '^PiCi = '^pinxi/pi = = X^r (13) 

2=1 2=1 2=1 

Hence, if Ci, • • • , Cg are independent copies of C and C® is 
dehned as: C® = - Y'f- ^ Ci then EC® = X^r. C® is thus an 
estimator of the matrix-vector product that we are interested 
in i.e the gradient of our objective function. 

According to the above, a relevant approach for estimating 
the extreme component of the gradient is to randomly sample 
s copies of C, to average them and then to look for the extreme 
component of this estimated gradient. 

Interestingly, this approach based on randomized matrix 
multiplication can be related to our deterministic approach. 
Indeed, a result given by 1221 (Lemma 4) says that the element 
p G A* that minimizes E[||X^r — C®|||n] is such that 

Pj cx |ri|||xi||. (14) 

It thus suggests that vectors Ci of large values of |ri|||xi|| 
have higher probability to be sampled. This resonates with 
the greedy deterministic approach in which vectors Xjr^ are 
accumulated in the order of the decreasing values of |ri|||xi||. 


D. Best arm identification and multi-armed bandits 

The two preceding approaches seek at approximating the 
gradient, at a given level of accuracy that has yet to be defined, 
and then at evaluating the coordinate of its extreme component. 

Yet, the problem of hnding the extreme component co¬ 
ordinate of a gradient vector obtained from matrix-vector 
multiplication can also be posed as the problem of hnding 
the best arm in a multi-armed bandit problem. In a nutshell, 
given a slot machine with multiple arms, the goal in the bandit 
problem is to hnd the arm that maximizes the expected reward 
or minimizes the loss. For this, an iterative procedure is used, 
where at each step, a forecaster selects an arm, based on his 
previous actions, and receives a reward or observes a loss. 
Depending on how the reward is obtained, the problem can 
be stochastic (the reward/loss is drawn from a probability 
distribution) or non-stochastic. Bubeck et al. IH propose an 
extensive review of these methods for various settings. 

We cast our problem of hnding the extreme gradient com¬ 
ponent as a best arm identihcation problem as follows. In 
the remainder of this section, we suppose that we look for 
a minimum gradient component and thus we look for the arm 
with minimal loss instead of maximal reward. We consider 
that the arms are the components of the gradient (we have d 
arms) and at each pull of a given arm, we observe a loss that is 
built from a term of the gradient matrix-vector multiplication 
X^r, as made clear in the sequel. 

In a stochastic setting, we consider a similar framework 
as the one we described in Section UlI-CI We model the loss 


{vj,i = TiX^^^jpi :i = l,...,n} 


so that P(14 = Vj^i) = Pi- From this dehnition, the expected 
loss of arm i is 

n n 

EF = '^PiVj,i = '^riX,,j = (X^r)j, 

2=1 2=1 


which is the j-th component of our gradient vector. In this 
setting, given a certain number of pulls, one pull providing a 
realization of the random variable V of the chosen atm, the 
objective of a best arm identihcation algorithm is to provide 
recommendation about the atm with minimal expected loss, 
which in our case is the coordinate of smallest value in the 
d-dimensional vector X^r. 

Several algorithms for identifying this best arm have been 
provided in the literature. Most of them are built around an 
empirical average loss statistic. This latter can be computed, 
after s pulls of an arm j, as Vj^s — where 

i{k,j) is the index i drawn at the fc-th pull for arm j. Some 
of the most interesting algorithms are the successive reject 
and successive halving ll24l algorithms which, given a 
hxed budget of pulls, iteratively discard after some predehned 
number of pulls (say s) the worst arm or the worst-half 
arms, respectively, according to the values These 

approaches are relatively simple and the successive halving 
approach is depicted in detail in Algorithm They directly 
provide some guarantees on the probability to correctly es¬ 
timate the extreme component of the gradient. For instance, 
for a fixed budget of pulls T, under some minor and easily 
satished conditions on {vj^i}, the successive halving algorithm 
correctly identihes the minimum gradient component with 
probability at least 1 — 31og2(i • exp(—where H 2 
is a problem-dependent constant (the larger H 2 is, the harder 
the problem is). 

However, these two algorithms have the drawbacks to work 
on individual entries Xij of the matrix X. Hence, overload 
due to single memory accesses compared to those needed for 
accessing chunks of memory may hinder the computational 
gain obtained by identifying the component with minimal 
gradient value without computing the full gradient. 

For the successive halving algorithm, one way of overcom¬ 
ing this issue is to consider non i.i.d sampling of the arm’s 
loss. As such, we consider that at each iteration, the losses 
generated by pulling the remaining arms come from the same 
component j of the residual. This approach has the advantage 
of working on the full vector x^r^, allowing thus efficient 
memory caching, instead of on individual elements of X. 

Let As' and A denote the sets of components i that have 
been drawn after s' and the subsequent pulls such that 
s = s' -f respectively, then the loss for arm j after s pulls 
can be dehned as 


= Vj,s' 


r^Xi 


i&A 


2*^2,J 

Pi 


(15) 


where Vjp = 0 and the set A is the same for all arms. 
In practice, as implemented in the numerical simulations we 
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Algorithm 5 Successive Halving to find the minimum gradient 
component 

1 : inputs: X, r, set T the budget of pulls 

2 : Initialize iSg = [d] 

3: t>-0 = 0,Vj e 5o 

4: for i= 0,1, • • • , \log 2 {d)~\ -1 do 

5: Pull each arm in Sg for rg = L|s 7 ]ji^“(;rp J additional 

times and compute resulting loss {non-iid) in 

Equation ( [T5] l or {non-stochastic) in Equation ( [T 6 | ) 
with Ri = J2i=o 

6 : Sort arms in S'f by increasing value of losses 

7: Define as the [|S'f|/2j indices of arms with 

smallest values 

8 : end for 

9: output: index of the best arm 


provide, each component i of ^ is randomly chosen according 
to a uniform distribution over 1, • • • ,n. The non-iidness of 
the stochastic process comes from that the arm losses are 
dependent through the set A. However, this does not hinder 
the fact that empirical loss Vj^s is still a relevant estimation of 
the expected loss of arm j. 

Non-stochastic best arm identification has been barely stud¬ 
ied and only a very recent work has addressed this prob¬ 
lem ll25l . In this latter work, the main hypothesis about 
the non-stochasticity of the loss is that they are assumed 
to be generated before the game starts. This is exactly our 
situation since before each estimation of the minimum gradient 
component, all losses are given by Xi^jCi and thus can be 
computed beforehand. In this non-stochastic setting ll25l . the 
framework is that the fc-th pull of an arm j provides a loss 
Vj^k and the objective of the bandit algorithm is to identify 
argmin^ limfc_).oo assuming that such limits exist for 
all j. Again we can fit our problem of finding the extreme 
gradient component (here the minimum) into this framework 
by defining the loss for a given arm at pull k as 

To if fc = 0 

'^3,k = < if 1 < k < u (16) 

[ Vj^k -1 if k > n 

where r is a predefined or random permutation of the rows of 
the vector r and the matrix X, and its fc-th entry. In practice, 
we choose t to be the same for all the arms for computational 
reasons as explained above, but in theory this is not necessary 
ll^ . According to this loss definition, we have 

n n 

= ^rkXk,j = (X^rjj. 
k=l k=l 

Hence, an algorithm that recommends the best arm after a 
given number of pulls, will return the index of the minimum 
component in our gradient. Interestingly, the algorithm pro¬ 
posed by Jamieson et al. Il25l for solving the non-stochastic 
best arm identification problem is also the one used in the 
stochastic setting namely the successive halving algorithm 
(Alg. 0 - This algorithm can be shown to work as is despite 
the dependence between arm losses. Indeed, each round-robin 


pull of the surviving arms can have dependent values, and as 
long as the algorithm does not adapt to the observed losses 
during the middle of a round-robin pull ll25ll . 

As already mentioned, for a fixed budget T of pulls, this 
successive halving bandit algorithm comes with theoretical 
guarantee in its ability to identify the best arm. In the stochas¬ 
tic case, the probability of success depends on the number of 
arms, the number of pulls T and on a parameter denoting the 
hardness of the problem (see Theorem 4.1 in 1241). In the non¬ 
stochastic case, the budget of pulls needed for guaranteeing 
the correct recovery of the best arm essentially depends on a 
function 7 j(fc) such that \vj^k — liuife-^oo < lj{k) ^nd on 
a parameter denoting the gap between any of (X^r)j and the 
smallest component of X^r which is not accessible unless we 
compute the exact gradient. 

One may note the strong resemblance between the matrix- 
vector product approximation as given in ( [T3] l and the non-iid 
bandit strategy, as in this latter setting, we consider the full 
vector XjTi to compute 14 g for all remaining arms. This non- 
iid strategy can also be related to the non-stochastic setting if 
we choose r as a random permutation of 1,..., n. In addition, 
in the non-stochastic bandit setting, we can recover the greedy 
deterministic approach if we assume that the permutation t 
defines a re-ordering of ||xi|||ri| in decreasing order, then 
the accumulation given in Equation ( [T6| ) is exactly the one 
given in the greedy deterministic approach. This is the choice 
of T we have considered in our experiments. Multi-armed 
bandit framework and the gradient approximation approaches 
use thus similar ways for computing the criteria used for 
estimating the best arm. The main difference resides in the fact 
that with multi-armed bandit, one is directly provided with the 
estimation of the best arm. 

E. Stopping criteria 

In the greedy deterministic and randomized methods in¬ 
troduced in this section, we have no clues on how many 
elements r^x^ have to be accumulated in order to achieve 
a sufficient approximation of the gradient or in the multi¬ 
armed bandit approach how many pulls we need to draw. Here, 
we discuss two possible stopping criteria for the non-bandit 
algorithms; one that holds for any approach and a second 
one that holds only for the Erank-Wolfe algorithm in the 
deterministic sampling case. Discussion on the budgets that 
needs to be allocated to the bandit problem is also provided. 

1) Stability condition: Eor the sake of simplicity, we limit 
the exposition to the search of the smallest component of the 
gradient, although the approach can be generalized to other 
cases. 

Denote by j* the coordinate such that j* = 
argmin^ VL(wfe)|_, and let Tg be the maximal number 
of iterations or samplings allowed for computing the inexact 
gradient (for instance, in the greedy deterministic approach, 
Tg = n). Our objective is to estimate j* with the fewest 
number t of iterations. Eor this to be possible, we make the 
hypothesis that there exists an iteration ti, ti < Tg, and 

j* = argmin VLt(wfe)|j \/t : ti < t < Tg 
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in other words, we suppose that starting from a given number 
of iterations ti, the gradient approximation is sufficiently accu¬ 
rate so that the updates of the gradient will leave the minimum 
coordinate unchanged. Formally, this condition means that 
Vt S [fi, • ■ • j n], we have 

< 

where each Ut+i = ri(j^j^u)'^i(t+u), i{t + u) being an index of 
samples that depends how the greedy or randomized strategy 
considered. However, checking the above condition is as 
expensive as computing the full gradient, thus we propose an 
estimation of j* based on an approximation of this inequality, 
by truncating the sum to few iterations on each side. Basically, 
this consists in evaluating j* at each iteration and checking 
whether this index has changed over the last Ng iterations. We 
refer to this criterion as the stability criterion, parametrized by 

Ng. 

2) Error bound criterion: In Section |II-B2| we discussed 
that the convergence of the Frank-Wolfe inexact algorithm 
can be guaranteed as long as the norm difference between 
the approximate gradient and the exact one could be upper- 
bounded by some quantity e. Formally, this means that the 
iterations over gradient approximation can be stopped as soon 
as ||VLt — VL||oo < e, where e depends on the curvature of 
the function L(w) ||3. In practice, the criterion || VL^ —VL||oo 
cannot be computed as it depends on the exact gradient but 
it can be upper-bounded by a term that is accessible. For the 
greedy deterministic approach, by norm equivalence, we have 


Ui 


t-\-U 


Vj G [I,- - • ,d]. 


VL* 


Ts-t 

-YUi 


t-\-u 


u—0 


||VLt-VL||oo 




OO 


i^It 


Hence if the norms {||xi||oo} have been precomputed before¬ 
hand, this criterion can be easily evaluated at each gradient 
update iteration. 

3) Pull budget for the bandit: In multi-armed bandit algo¬ 
rithms, one typically specifies the number of pulls T available 
for estimating the best arm. As such, T can be considered 
as hyperparameter of the algorithm. A possible strategy for 
removing the dependency of the bandit algorithm to this pull 
budget, is to use the doubling trick ll25l . which consists in 
running the algorithm with a small value of T and then 
repeatedly doubling it until some stopping criterion is met. 
The algorithm can then be adapted so as to exploit the loss 
computations that have already been carried from iteration to 
iteration. However, this strategy needs a stopping criterion 
for the doubling trick. According to Theorem 1 in 1^ . 
there exists a lower bound of pulls for which the algorithm 
is guaranteed to return the best arm. Hence, the following 
heuristic can be considered; if T' and 2T' number of pulls 
return the same best arm, then we conjecture that the proposed 
arm is indeed the best one. One can note that this idea is 
similar to the above-described stability criterion. While this 
strategy is appealing, in the experiment, we have just fixed 
the budget of pulls T to a fixed predefined value. 


IV. Discussion 

This section provides comments and discussions of the 
approaches we proposed compared to existing works. 

A. Relation and gains compared to OMP and variants 

Several recent works on sparse approximation have pro¬ 
posed theoretically founded algorithms. These works include 
OMP lEl, Eg, greedy pursuit HSl, ||23, CoSaMP US) and 
several others like 1^ . Most of these algorithms make use of 
the top absolute entry of the gradient vector at each iteration. 
The work presented in this paper is strongly related to these 
ones as we share the same quest for the top entry. Indeed, 
the proposed methodology provides tools that can be applied 
to many sparse approximation algorithms including the afore¬ 
mentioned ones. What makes our work novel and compelling 
is that at each iteration, the gradient is computed with as few 
information as possible. If the stopping criterion for estimating 
this gradient is based on a maximal number of samples — e.g. 
we are interested in constructing the best approximation of the 
gradient from only 20% of the samples—, our approach can 
be interpreted as a method for computing the gradient on a 
limited budget. Hence, the proposed method allows to obtain 
a gain in the computational time needed for the estimation 
of the gradient. On the downside, if other stopping criteria 
are used (alone or jointly with the budget criterion), this gain 
may be partly impaired by further computations needed for 
their estimation. As an example, the stability criterion induces 
a 0{d) overhead at each iteration due to the max computation. 


B. Relation with other stochastic MP/FW approaches 

Some prior works from the literature are related to the 
approaches we have proposed in the present paper. Chen 
et al. 1 ^ have recently introduced a stochastic version of 
a Matching Pursuit algorithm in a Bayesian context. Their 
principal contribution was to define some prior distribution 
over each component of the vector w and then to sample over 
this distribution so as to estimate w. In their approach, the 
sparsity pattern related to Matching Pursuit is controlled by 
the prior distribution which is assumed to be a mixture of 
two distributions one of which induces sparsity. While this 
approach is indeed stochastic, it strongly differs from ours in 
the way stochasticity is in play. As we will discuss in the 
next subsection, our framework is more related to stochastic 
gradient than to the stochastic sampling of Chen et al. 

Stronger similarities with our work appear in the work of 
Peel et al. 1^ . Indeed, they propose to accelerate the atom 
selection procedure by randomly selecting a subset of atoms 
as well as a subset of example for computing X^r. This idea 
is also the base of our work. However, an essential difference 
appears as we do not select a subset of atoms. By doing so, we 
are ensured not to discard the top entry of X^r and thus we 
can guarantee for instance that our bandit approaches are able 
to retrieve this top entry with high probability given enough 
budget of pulls. 

Stochastic variants of the Frank-Wolfe algorithm have been 
recently proposed by Lacoste-julien et al. m and Ouyang 










et al. II 32 I . These works are mostly tailored for solving large- 
scale SVM optimization problem and do not focus on sparsity. 


C. About stochasticity 

The randomization approach for approximating the gradient, 
introduced in Section I1I-C[ involves random sampling of the 
columns. In the extreme situation where only a single column 
i is sampled, we thus have VT = x^ri, and the method 
we propose boils down to a stochastic gradient method. In the 
context of sparse greedy approximation, the first work devoted 
to stochastic gradient approximation has been recently released 
113. Nguyen et al. 03 show that their stochastic version 
of the iterative hard thresholding algorithm, or the gradient 
matching pursuit algorithm which aim at greedily solving a 
sparse approximation problem with arbitrary loss functions, 
behave properly in expectation. 

The randomized approach we propose in this work goes 
beyond the stochastic gradient method for greedy approxima¬ 
tion, since it also provides a novel approach for computing 
stochastic gradient. Indeed, we differ from their setting in 
several important aspects: 

• First, in our stochastic gradient approximation, we always 
consider a number of samples larger than 1. As such, we 
are essentially using a stochastic mini-batch gradient. 

• Second, the size of the mini-batch is variable (depending 
on the stopping criterion considered) and it depends on 
some heuristics that estimates on the fly the ability of the 
approximate gradient to retrieve the top entry of the true 
gradient. 

• Finally, one important component of our approach is 
the importance sampling used in the stochastic mini¬ 
batch sampling. This component theoretically helps in 
reducing the error of the gradient estimation ll33 . In the 
context of matrix multiplication approximation we used 


for developing the randomized approach in Section III-C 
theoretical results of Drineas et al. 


have also shown 
there exists an importance sampling that minimizes the 
expectation of the Frobenius norm of the matrix multipli¬ 
cation approximation. Our experiments corroborate these 
results showing that, compared to a uniform sampling, 
importance sampling clearly enhances the efficiency in 
retrieving the top entry of the true gradient. 

All these differences make our randomized algorithm not only 
clearly distinguishable from stochastic gradient approaches, 
but also harder to analyze. We thus defer for further work 
the theorical analyses of such a stochastic adaptive-size mini¬ 
batch gradient coupled with an importance sampling approach. 


D. Theoretical considerations 

Although a complete theoretical analysis of the algorithm is 
out of the scope of this paper, an interesting property deserves 
to be mentioned here. Note that, unlike stochastic gradient 
approaches, our algorithm is built upon inexact gradients that 
hopefully have the same minimum component as the true 
gradient. If this latter fact occurs along the iterations of the 


FW or OMP algorithms, then all the properties {e.g linear 
convergence, exact recovery property...) of these algorithms 
apply. Based on the probability of recovering an exact min¬ 
imum component of the gradient at each iteration, we show 
below a bound on the probability of our algorithm to recover 
at a given iteration K of OMP or FW, the same sequence of 
minimum components as the one obtained with exact gradient. 

Suppose that at each iteration t of OMP or FW, our 
algorithm for estimating the minimum component correctly 
identifies this component with a probability at least 1 — Pt and 
there exists P so that Pt > P, Vf < K, then the probability of 
identifying the correct sequence of minimum component is at 
least 1 — KP. We get this thanks to the following reasoning. 
Denote B{t) = {/^ = il,I^ = = i*} the event 

that our algorithm outputs the exact sequence of minimum 
components up to iteration t, P and being the coordinates 
retrieved with the inexact and exact gradient. Similarly, we 
note A{t) = {P = ig} the event of retrieving, at iteration 
t, the correct component of the gradient. We assume that 
P(P(0)) = 1. Formally, we are interested in lower-bounding 
the probability of B{K). By definition, we have 

P(P(A:)) = f'{A{K)\B(K - 1))¥{B{K - 1)) 

K 

= p(p(o))nn^wi^(i-i))- 

i=l 

Note that this equation captures all the time dependencies 
that occur during the FW or the OMP algorithm. Since 
¥{A{t)\B{t — 1)) > 1 — Pt, we have 

K 

nB{K)) > [](1 - Pt) > (1 - P)^ > (1 - KP) 

t^l 

where in the last inequality, we used the fact that (l — u)^ > 
(1 — Ku) for 0 < M < 1. 

For instance, in the successive halving algorithm, we have 
Pt = 3log 2 d-exp sH 2 (f)iog 2 d )^ where P 2 (i) is a iteration- 
dependent constant ll24l and T the number of pulls. Thus, if 
we dehne H so that Vf, H > H 2 {t), we have P = 31og2 d ■ 

exp (- mkpd) 

P(B(A-))>l-3A-bg,d.exp(-j^). 

We can see that the probability of our OMP or FW having 
the same behaviour as their exact counterpart decreases with 
the number K of iterations and the number d of dimensions 
of the problems and increases with the number of pulls. 
By rephrasing this last equation, we also get the following 
property. For S G [0,1], if the number T of pulls is set so that 
at each iteration t, 

T > ^loglog 2 ^ -f log y -f 8 Plog 2 d 

then, when using the successive halving algorithm for retriev¬ 
ing the extremum gradient component, an inexact OMP or FW 
algorithm behaves like the exact OMP or FW with probability 
1 — d. This last property is another emphasis on the strength of 
our inexact gradient method compared to stochastic gradient 
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descent approaches as it shows that with high probability, all 
the theoretical properties of OMP or FW (e.g convergence, 
exact recovery of sparse signal) apply. 

V. Numerical experiments 

In this section, we describe the experimental studies we have 
carried out for illustrating the computational benehts of using 
inexact gradient for sparsity-constraint optimization problems. 

A. Experimental setting 

In order to illustrate the beneht of using inexact gradient 
for sparse learning or sparse approximation, we have set up 
a simple sparse approximation problem which focuses on the 
computational gain, and for which a sparse signal has to be 
recovered by the Frank-Wolfe, OMP or CoSaMP algorithm. 

Note that sparse approximations are mostly used for ap¬ 
proximation problems on overcomplete dictionary. This is 
the case in our experiments, where the dimension d of the 
learning problem is in most cases larger than the number 
n of samples. We believe that if the signal or the image 
at hand to be approximated can be fairly approximated by 
representations for which fast transforms are available, then it 
is better (and faster) to indeed used this representation and the 
fast transform. Sparse approximation problems as considered 
in the sequel, mostly occur in overcomplete dictionary learning 
problems. In such a situation, as the dictionary is data-driven, 
we believe that the approach we propose is relevant. 

The target sparse signals are built as follows. For a given 
value of the dictionary size d and a number k of active 
elements in the dictionary, the true coefficient vector w* is 
obtained as follows. The k non-zero positions are chosen 
randomly, and their values are drawn from a zero-mean 
unit variance Gaussian distribution, to which we added ±0.1 
according to the sign of the values. The columns of the 
regression design matrix X G are drawn uniformly 

from the surface of a unit hypersphere of dimension n. Finally, 
the target vector is obtained as y = Xw* ± e, where e is a 
random noise vector drawn from a Gaussian distribution with 
zero-mean and variance Ug determined from a given signal- 
to-noise as = i||Xw*|p • unless specihed, 

the SNR ratio has been set to 3. For each setting, the results 
are averaged over 20 trials, and X, w* and e are resampled 
at each trial. 

The criteria used for evaluating and comparing the proposed 
approaches are the running time of the algorithms and their 
ability to recover the true non-zero elements of w*. The latter 
is computed through the F-measure between the support of the 
true coefficient vector w* and the estimated one w: 

„ „ |supp (w*)Usupp (w)| 

F-meas = 2------;-——■ 

|supp..,(w*)| ± |supp..^(w)| 

where, supp..^(w) = {j : \wj\ > 7 } is the support of vector w 
and 7 is a threshold used to neglect some non-zero coefficients 
that may be obliterated by the noise. In all our experiments, we 
have set 7 = 0.001 which is small compared to the minimal 
absolute value of a non-zero coefficient (0.1). 


All the algorithms (Frank-Wolfe, OMP and CoSaMP) and 
exact and inexact gradient codes have been written in Matlab 
except for the successive reject bandit which as been written in 
C and transformed in a mex hie. All computations have been 
run on each single core of an Intel Xeon E5-2630 processor 
clocked at 2.4 GHz in a Linux machine with 144 Gb of 
memory. 

B. Sparse learning using a Frank-Wolfe algorithm 

For this experiment, the constraint set C is the £i unit-ball 
and the loss function is L(w) = i||y —Xw|||. Our objectives 
are two-folded: 

1) analyze the capability of inexact gradient approaches to 
recover the true support and compare them to the FW 
algorithm, 

2) compare the two stopping criteria for computing the 
inexact gradient : the stability condition and the error 
bound condition. 

In the latter, while this error bound condition provides an 
adaptive condition for stopping — recall that the parameter e 
in Equation ( [T7] i is determined automatically through the data 
and the related curvature of the loss function —, the stability 
condition needs a user-dehned parameter Ns for stopping 
the accumulation of the partial gradient. In the same spirit, 
we use a hxed pre-dehned budget of pulls in the best-arm 
identihcation problem. This budget is given as a ratio of 
nx d. The exact gradient is computed using the accumulation 
strategy as given in Equation Q so as to make all running 
times comparable. The maximum number of iterations for EW 
is set to 5000. 

Eigure[2presents the results obtained for n = 2000 samples, 
d = 4000 dictionary elements and fc = 50 active atoms. We 
depict the running time and recovery abilities of the Erank- 
Wolfe algorithm with an exact gradient (exact), a greedy 
deterministic gradient sampling computation with a stability 
stopping criterion (deterministic) and an error bound stopping 
criterion (grad upb), the randomized approach with an uni¬ 
form sampling (uniform), and with a best probability sampling 
as given in Equation ( [T^ (best), the successive reject bandit 
(succ), the non-iid successive halving approach with losses 
computed as given in Equation ( [T5] l (SuccHalvSame) with a 
random uniform sampling and the non-stochastic successive 
halving approach with losses computed as given in Equation 
( [T6| ) (SuccHalvNonStoch) using a permutation t that dehnes 
a decreasing ordering of the ||xi|||ri|. 

The hgures depict the performances with respect to the 
stopping condition parameter Ng of the stability criterion (the 
hrst value in the bracket) and the sampling budget of the bandit 
approach where z is the second value in the bracket). 

Eirst, we can note that the deterministic approach used with 
any stopping criterion and the non-stochastic successive halv¬ 
ing approaches are able to perfectly recover the exact support 
of the true vector w*, regardless of the considered stopping 
criterion’s value. Randomized approaches with uniform and 
best probability sampling nearly achieve perfect recovery with 
an average E-measure of 0.975 with the stability criterion Ns 
equal to 5 for the uniform approach. When Ng increases, the 
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Fig. 1. Comparing vanilla Frank-Wolfe and inexact FW algorithms with different ways for computing the inexact gradient with n = 2000, d = 4000 and 
k = 50. Performances are compared with increasing precision on the inexact computations. We report the exact recovery performance and the running time 
of algorithms. 


performances of these two approaches also increase but still 
fail to achieve perfect recovery. 

From a running time point of view, the proposed approaches 
based on greedy deterministic and randomized sampling 
strategies with stability criterion and the successive halving 
strategies are faster than the exact FW approach, the plain 
successive reject method acting on single entry of {xj^iTj} 
and the deterministic method with the error bound condition. 
For instance, the greedy deterministic approach (green curve) 
achieves a gain in running time of a factor 2 with respect to 
the exact Frank-Wolfe algorithm. Interestingly for the greedy 
deterministic approach and the successive halving approaches, 
this gain is achieved without compromise on the recovery 
performance. For the randomized strategies, increasing the 
stability parameter Ng leads to a very slight increase of run¬ 
ning time, hence for these methods, a trade-off can eventually 
be found. When comparing bandit approaches, one can note 
the substantial gain in performances that can be obtained by 
the halving strategy, the non-iid and non-stochastic strategies 
compared to successive reject. We conjecture that this higher 
computational running time of the successive reject algorithm 
is essentially due to computational overhead needed for ac¬ 
cessing each single matrix entry in memory while all 
other methods use slices of this matrix (through the samples 
Xi) and thus they can leverage on the chunk of memory access. 
Best performances jointly in recovery and running times are 
achieved by the greedy deterministic and the non-stochastic 
successive halving approaches. 

When comparing the stability and the error bound stopping 
criteria, the latter one is rather inefficient. While grounded on 
theoretical analysis, this bound is loose enough to be non- 
informative. Indeed, a careful inspection shows that the error 
bound criterion accumulates about 5 times more elements 
TiXi than the stability one before triggering. In addition, other 
computational overheads necessary for the bound estimation, 
make the approach just as efficient as the exact Frank-Wolfe 
algorithm. 

In summary, from this experiment, we can conclude that the 
non-stochastic bandit approach is the most efficient one. It can 


achieve a gain in computation of about an order of magnitude 
(the left most point in the Figure [T]s right panel) without 
compromising accuracy. The greedy deterministic approach 
with stability criterion performs also very well but it is slightly 
less efficient. We can remark that these two best methods 
both use the same strategy of gradient accumulation based 
on decreasing ordering of ||xi|||ri|. 

C. Sparse Approximation with OMP 

Here, we evaluate the usefulness of using inexact gradient 
in a greedy framework like OMP. The toy problem is similar 
to the one used above except that we analyze the performance 
of the algorithm for an increasing number k of active atoms 
and two sizes of dictionary matrix X have been considered. 

The same ways for computing the inexact gradient are 
evaluated and compared in terms of efficiency and correctness 
to the true gradient in an OMP algorithm. For all sampling 
approaches, the stopping criterion for gradient accumulation 
is based on the stability criterion with the parameter Ng 
adaptively set at 2% of the number n of samples. For the 
successive reject bandit approach, the sampling budget has 
been limited to 20% of the number of entries (which is n ■ d 
in the matrix X. In all cases, the stopping criterion for the 
OMP algorithm is based on a fixed number of iterations and 
this number is the desired sparsity k. 

Results are reported in Figure They globally follow 
the same trend as those obtained for the Frank-Wolfe algo¬ 
rithm. First, note that in terms of support recovery, when 
the number of active atoms is small, the greedy determin¬ 
istic approach performs better than the randomized sampling 
strategies. Bandit approaches perform similar to the greedy 
deterministic method. As the number of active atoms increases, 
the bandit approaches succeed better in recovering the extreme 
component of the gradient while the deterministic approach 
is slightly less accurate. Note that for any value of k, the 
randomized strategies suffer more than the other strategies for 
recovering the true vector w* support. From a running time 
point of view, again, we note that the deterministic and non- 
iid successive halving bandit approaches seem to be the most 
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Fig. 2. Comparing OMP algorithm with different ways for computing the inexact gradient. The comparison holds for dictionary size with (top) n = 2000, 
d = 4000. (bottom) n = 5000, d = 10000 and for an exact recovery criterion (left) and running time (right). 


efficient methods. The gain in running time compared to the 
exact gradient OMP is slight but significant while it is larger 
when comparing with the successive reject algorithm. 

D. Sparse Approximation with CoSaMP 

To the best of our knowledge, there are very few greedy 
algorithms that are able to leverage from stochastic gradient. 
One of these algorithms has been introduced in ifTSl . In this 
experiment, we want to evaluate the efficiency gain achieved 
by our inexact gradient approach compared to this stochastic 
greedy algorithm. Our objective is to show that the approach 
we propose is empirically significantly faster than a pure 
stochastic gradient approach. For the different versions of 
the CoSaMP algorithm, we have set the stopping criterion as 
follows. For the CoSaMP with exact gradient approach, which 
serves only as a baseline for computing the exact solution, the 
number of iteration is set to the level of sparsity k of the 
target signal. A tolerance of the residual norm is also used 
as a stopping criterion which should be below 10“^. Next, 
for the stochastic and the inexact gradient CoSaMP versions, 
the algorithms were stopped when the norm of the residual 
(y — Xw) became smaller than 1.001 times the one obtained 
by the exact CoSaMP or when a maximal number of iterations. 
Regarding gradient accumulation, the stopping criterion we 
choose is based on the stability condition with the parameter 
Ns set dynamically at 2% of the number of samples. For 
the bandit approaches, we have fixed the budget of pulls at 


0.2n X d. 

Note that for the CoSaMP algorithm, we do not look for 
the top entry of the gradient vector but for the top 2k entries 
as such, we have thus straightforwardly adapted the successive 
halving algorithm to handle such a situation. 

Figure presents the observed results. Regarding support 
recovery, we remark that all approaches achieve performances 
similar to the exact CoSaMP. When few active atoms are in 
play, we can note that sometimes, the stochastic approach of 
lfT3]l fails to recover the support of w*. This occurs seldom 
but it happens regardless of the dictionary size we have 
experimented with. 

From a running time perspective, the results show that the 
proposed approaches are highly more efficient than the exact 
gradient approach and interestingly, they are faster than a pure 
gradient stochastic approach. One or two orders of magnitude 
can be gained depending on the level of sparsity of the signal 
to be recovered. This observation clearly depicts the trade-off 
that occurs in sparsity-constrained optimization problems in 
which the gradient computation and an approximation problem 
on a limited number of atoms are the major computational bur¬ 
dens (Lines 3 and 5 of Algo[2l- Indeed in a stochastic gradient 
approach, inexact gradient computations are very cheap but 
more approximation problems to be solved may be needed for 
achieving a desired accuracy. In the approaches we propose, 
the inexact gradient computation is slightly more expensive but 
we somehow “ensure” that it provides the correct information 
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Fig. 3. Comparing CosAMP and CosAMP algorithms with different ways for computing the inexact gradient. The comparison holds for different dictionary 
size n = 2000, d = 4000 and n = 5000, d = 10000. We report exact recovery performance and running time. 


needed by the CoSaMP algorithm. Hence, our approaches need 
less approximation problems to be solved, making them more 
efficient than a stochastic gradient approach. 

When comparing the efficiency of the proposed algorithms, 
the approach based on non-stochastic successive halving 
and the greedy deterministic approach are the most efficient 
especially as the number of active atoms grows. 

In the second experiment, for the successive halving al¬ 
gorithms, we analyze the effect of the pull budget on the 
running time and on the recovery performance. We consider 
the following setting of the problem: n = 5000, d = 10000 
and k = 50. Results are given in Figure We note that 
a budget ratio varying from 0.05 and 0.7 allows a good 
compromise between ability of recovering the true vector 
and gain in computational time, particularly for the non¬ 
stochastic successive halving method. As the budget of pulls 
decreases, both algorithms fail more frequently in recovery 
and in addition, the computational gain substantially reduces. 
This experiment suggests that one should not be too greedy 
and should allow sufficient amount of pulls. A budget ratio of 
0.2 or 0.3 for the bandit algorithms seems to be a good rule 
of thumb according to our experience. 

Our last experiment with CoSaMP demonstrates how the 


running time and the support recovery performance behave 
with an increasing number n of samples and afterwards 
with an increasing number of dictionary elements d. We 
have restricted our comparison to the exact and stochastic 
CoSaMP, and the CoSaMP variants based on the successive 
halving bandit algorithms and the greedy deterministic one 
(which are the most efficient among those we propose). The 
experimental setup, the stopping criterion for the CoSaMP 
algorithm as well as the stopping criterion for the gradient 
accumulation and pull budget are the same as above. Results 
are depicted in Figure As a sanity check, we note that 
recovery performances are almost similar for all algorithms 
with slightly worse performances for the stochastic CoSaMP 
and the non-iid bandit algorithm based CoSaMP. 

The computational time results show that all algorithms 
globally follow the same trend as the number of dictionary 
atoms or the number of samples increase. Recall that the 
computational complexity for the gradient computation is 
0{nd). For the bandit approaches, we use a fixed budget 
of pulls dependent on nd to compute the inexact gradient. 
Similarly, for the greedy deterministic approach, the number 
of accumulation (and the stability criterion) is proportional to 
the number n of samples and thus the gradient computation 
is a constant factor of nd. Hence, our findings, illustrated on 




























13 


dim = 5000 #Dic = 10000 



dim = 5000 #Dic = 10000 




’ ^ 1 

-H- Exact 

Stoch 1 ^ 

SuccHalv same i 

SuccHaiv NonStoch 


10 -^ 10 -^ 10 -^ 10 ° 
Budget Ratio 


Fig. 4. Analyzing the effect of the pull budget on the successive halving algorithm (left) recovery F-measure and (right) computational running time. The 
pull budget is defined as the budget ratio times n • d. Here n = 5000, d = 10000 and k = 50. 


Figure are somewhat natural since the main differences of 
running time essentially come from a constant factor. This 
factor is highly dependent on the problem but according to our 
numerical experiments, a ten-fold factor computational gain 
can be expected in many cases. 

E. Application to audio data 

We have compared the efficiency of the approaches we 
propose on a real signal processing application. The audio 
dataset we use is the one considered by Yaghoobi et al. 041 . 
This dataset is composed of an audio sample recorded from 
a BBC radio session which plays classical music. From that 
audio sample, 8192 pieces of signal have been extracted, 
each being composed of 1024 time samples. Details about 
the dataset can be found in 04l . From this dataset, we have 
learned 2048 dictionary atoms using the approach described in 
031 . Our objective is to perform sparse approximation of each 
of the 8192 audio pieces over the 2048 dictionary atoms using 
CoSaMP and we want to evaluate the running time and the 
approximation quality of a CoSaMP algorithm using an exact 
gradient computation (Exact), a stochastic gradient CoSaMP 
algorithm (Stoch k) and the CoSaMP variants with inexact 
gradient computations as we propose. The approximation error 
is measured as where y and y are respectively the 

true audio piece and its CoSaMP-based approximation. We 
thus want to validate that our approaches achieve similar 
approximation performance than CoSaMP while being faster. 
For all algorithms, the number of CoSaMP iterations is hxed 
to the sparsity pattern, here fixed to k = 10. Note that for 
the stochastic gradient approach, we have also considered a 
version with more iterations (Stoch 3k) . Results are gathered 
in Table and they are obtained as the averaged performance 
when approximating all the 8192 pieces of audio signal in the 
dataset. We can see that the inexact approaches we introduce 
lead to the best compromise between approximation error and 
running time. For instance, our successive halving algorithms 
achieve similar approximation errors than the exact CoSaMP 
but they are 3 times faster. At the contrary, the stochastic 


TABLE I 

Approximation performance results and running time for 
CoSaMP and variants, y and y respectively depicts the signal 
AND ITS resulting APPROXIMATION. RESULTS ARE AVERAGED OVER 
THE APPROXIMATION OF 4500 SIGNALS. Stock 3k DENOTES THE 
STOCHASTIC GRADIENT ALGORITHM THAT USED 3k ITERATIONS. 


Approaches 

IT 

-y 

y|| 

1 

Time (s) 

exact 

0.376 

± 

0.22 

0.164 

± 

0.01 

stoch k 

0.670 

± 

0.13 

0.026 

± 

0.00 

stoch 3k 

0.570 

± 

0.17 

0.076 

± 

0.01 

uniform 

0.351 

± 

0.21 

0.133 

± 

0.02 

deterministic 

0.361 

± 

0.22 

0.187 

± 

0.02 

SuccHalvSame 

0.371 

± 

0.22 

0.059 

± 

0.01 

SuccHalvNonStoch 

0.374 

± 

0.22 

0.064 

± 

0.01 


gradient CoSaMP approaches are efficient but lack in properly 
approximating target audio pieces. 

F. Benchmark classification problems 

We have also benchmarked our algorithms on real-world 
high-dimensional learning classification problems. These 
datasets are frequently used for evaluating sparse learning 
problems m, ii and more details about them can be 
found in these papers. Here, we considered CoSaMP as a 
learning algorithm and our objective is to validate the fact 
that the approaches we propose for computing approximate 
gradient are able to speed up computation time while achieving 
the same level of accuracy as the exact gradient. For the 
approximate gradient computations, we have considered the 
stability criterion with Ng = (ntrain being the number 

of training examples) for the deterministic and randomized 
approaches, and we have set the budget as 0.2ntrain x d for 
the bandit approaches. 

The protocol we have set up is the following. Training 
and test sets are obtained by randomly splitting the dataset 
in a 80% — 20% fold. For model selection, the training set 
is further split in two sets of equal size. The parameter we 
have cross-validated is the number of non-zero elements k 
in w. It has been selected among 10, 50,100, 250 so as to 
maximize the accuracy on the validation set. This value of k 
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dim = 2000 k = 100 dim = 2000 k = 100 
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Fig. 5. (top left and top right) Evaluating how the recovery capability behaves and how the computation time scales with the number of dictionary elements 
(with n = 2000 and k = 100). (bottom left and bottom right) Evaluation of the same criteria with respect to the number of samples (with d = 10000 and 
k = 400). 


has also been used as the maximal number of iterations for all 
the algorithms except for the stochastic ones. For these, we 
have reported accuracies and running times for a number of 
maximal iterations of k and 3k. 

Results averaged over 20 replicas of the training and test sets 
are reported in Table Stochastic approaches fail in learning 
a relevant decision function. We can note that our deterministic 
and randomized approaches are more efficient than the exact 
CoSaMP but are less accurate. On the other hand, our bandit 
approaches achieve nearly similar accuracy to CoSaMP while 
being at least 30 times faster. 

VI. Conclusions 

The methodologies proposed in this paper aim at accelerat¬ 
ing sparsity-constrained optimization algorithms. This is made 
possible thanks to the key observation that, at each iteration, 
only the component of the gradient with smallest or largest 
entry is needed, instead of the full gradient. By exploiting 
this insight, we proposed greedy algorithms, randomized ap¬ 
proaches and bandit-based best arm identification methods for 
estimating efficiently this top entry. Our experimental results 
show that the bandit and the greedy approaches seem to be 
the most efficient methods for this estimation. Interestingly, the 


bandit approaches come with guarantees that, given a sufficient 
number of draws, this top entry can be retrieved with high- 
probability. 

Future works will be geared towards gaining further theo¬ 
retical understandings on the good behaviour of the greedy 
approach, linking the number of iterations needed for the 
Frank-Wolfe algorithm to converge, with the quality of the 
gradient approximation in the greedy and randomized ap¬ 
proaches, analyzing the role of the importance sampling in 
the randomized methods. In addition, we plan to explore how 
this work can be extended to an online and/or distributed 
computation setting. 
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